-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sanitization and validation for files #3
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about that comment? I cannot see anything related, is there any reason?
} | ||
|
||
// Validate size | ||
if self.size <= 0 || self.size > MAX_SIZE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we taking the client declared size
in /pubky.app/file/<object>
at face value? Is there any way we can actually check the size of the blob? Maybe we should have a blob
model with the validation that should be passed after we get the blob response in ingest()
and before store_blob()
in events/handlers/file.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the user has any incentive to lie about the size of the file.
Having said that, it looks close to the content-type issue. Our best check would be to make sure it matches the content-length header returned from the homeserver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something we can write into this specs crate? Should be as part of the /blob
validation that probably has to run as a child validation of the /file
one ?
|
||
Self { | ||
name, | ||
created_at: self.created_at, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we cannot use hash_id
. We have a created_at
timestamp anyway, therefore it is duplicated information. It's a straightforward method for the user to post 3 times the same .gif
without creating 3 times the file on his homeserver or the indexer.
Alternatively, we have name
field, that could as well be the file and blob name instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the emphasis on hash_id for? I understand its use case for something like tags, but we don't need that here.
The use case you're mentioning is not solved by having the same hash_id. With the current way, a sane user can create one blob, one file and then use that one file anywhere he wants. Should we stop the ability of someone to create multiple blobs and multiple files? I don't think so. I think it's important to remember we're a proxy here and people can do whatever they want with their homeservers, so while we enable, we shouldn't really look into ways of prohibiting some use cases, no matter how fringe they might sound to us, when there's no problem with having them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, not necessarily saying hash_id
should go on the /file
object only, I am talking about the obvious big benefits of hash_id
for the data blob. Are the /blobs
and the way Nexus uses them covered by the specs?
There is no use for the timestamp ids.
Consider what happens when in pubky-app I am a shit poster and I reply to 10 users by dropping into the post editor modal the same animated gif to make fun of them.
people can do whatever they want with their homeservers
This is barely an argument as the specs are created with the explicit intention of restricting the way data is written into homeservers into a common set of rules that allow interoperability between social pubky-app clients and indexers. A user can write whatever data he wants in whatever spec breaking way he wants and simply share a URI.
when there's no problem with having them
I think free storage saving for anyone using pubky-app according to specs is really good for such a small change (just hash_id for blobs). Might seem a stupid optimization this early but changing IDs and schemas post-launch will be harder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For blob ids it does make sense to have hash_ids.
// validate content type | ||
match Mime::from_str(&self.content_type) { | ||
Ok(mime) => { | ||
if !VALID_MIME_TYPES.contains(&mime.essence_str()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declared mime type could be pdf
and blob be a exe
, right? On click on the browser from pubky-app, the user will download a .exe
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. The user doesn't really have an incentive to lie about this.
Ultimately of course, we can have some security checker that runs the files through some malware detector, but mind you, something like this needs to be implemented on the homeserver first. The same goes for content type and sniffing the actual file type of the blob.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user doesn't really have an incentive to lie about this.
Can you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm saying for normal users there's no need to lie about the content type of their files.
For malicious intent, you don't have to lie about the content type. IMO, If there are to be checks for this, it should be on the homeserver, not here.
src/file.rs
Outdated
const MIN_NAME_LENGTH: usize = 1; | ||
const MAX_NAME_LENGTH: usize = 255; | ||
const MAX_SRC_LENGTH: usize = 1024; | ||
const MAX_SIZE: i64 = 10_000_000; // 10 MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we using decimal MB?
To avoid confusion maybe best to do 2^20 B
const MAX_SIZE: i64 = 10 * (1 << 20);
or 10_485_760
No description provided.